Descriptive Statistics

Author

James Van Slyke

Opening datasets

Sometimes pre-existing datasets are imported into R Studio. There are several different types of datasets that can be imported. The ones used most frequently in this course are from Excel and SPSS. Types of computer files are indicated by what follows the period (“.”) in the file name. For example file.docx is a type of word file.

Here’s a list of common file types we’ll use.

  • Excel = .xlsx
  • SPSS = .sav
  • Comma Seperated Values = .csv

One files we’ll be using today is the Album Sales file. On the moodle page for the course at the very top underneath the link for zoom sessions, you’ll see a folder with SPSS Data Sets. When you click on the folder you’ll see a list of datasets. Click on the one that says, “Album Sales.sav” and download it to your computer.

Once the file is downloaded go to the upper right pane in R Studio. Under the environment tab you’ll see a button that says “Import Dataset”. Click on the button and you’ll get several different options. Go to the one that says “From SPSS” and click on it. At the top of the window you’ll need to find the file you are importing to open it. Click on the Browse buttom to find the file. Once you find it click open and it will give you a preview of the file in the window. Then all you need to do is click import.

Here is the code for importing the file

library(haven)
Album_Sales <- read_sav("Album Sales.sav")
View(Album_Sales)

Descriptive Statistics

Descriptive statisitics are just what the name implies, they are used to describe a dataset. This is different from inferential statistics, which are used to infer population paramenters based on a sample.

One of the most common sets of descriptive statistics is known as central tendency, which is simply finding a way to get a basic overview of a dataset as a whole.

Mean

The easiest way to measure central tendency is to find the average or the mean. The formula for the mean is very straightforward. It’s simply the sum of all the scores divided by the number of scores. The symbol for the sample mean is an x with a bar over it. \[ \bar X=\frac{\Sigma X}{N} \] The capital Greek letter Sigma in the numerator stands for sum and X stands in for all the values in a dataset with the N standing for the number of values in the dataset. In R Studio, we simply use the mean command to find the mean for a particular variable

mean(diamonds$carat)
[1] 0.7979397

Median and Mode

There are two other measures of central tendency that are often used in statistics, median and mode. The median is simply the center score, so to find it, you simply rank order your values from lowest to highest and the middle score in that rank order. For even scores you take the two middle scores add them and divide them by two. For odd scores there is only one middle score. Here again, R gives us an easy way to find the median.

median(diamonds$carat)
[1] 0.7

The mode is the score or value with the highest frequency in your dataset. So whatever value occurs the most times is the mode. For some reason, mode is not part of the basic R package, so you need to install the “modeest” package if you need to find this descriptive statistic.

library(modeest)

Once the library is installed use the mfv command

mfv(diamonds$carat)
[1] 0.3

We tend to use the mean most often for central tendency, but it’s more effected by extreme scores. So if you have several low or high outliers (extreme scores that are higher or lower than the mean), the median may be a more accurate representation of the data. More often then not, we’ll use the mean. The mode is a more straightforward measure. So if you were interested in the most popular song on an album based on downloads, you would find the mode. The mode is often very helpful with categorical data.

Variability

The second most important aspect of descriptive statistics is variability or dispersion in a dataset. This typically represents how the rest of the dataset relates to the mean or some other measure of central tendency. For example: are the scores close to or widely dispersed from the mean?

Standard Deviation

Often times were interested in the average spread or deviation from the mean. Deviance is the distance from any particular score from the mean. So to find the deviance you simply take the score and subtract it from the mean. \[ deviance = X - \bar X \] If we wanted to find out the total amount of deviance, we could simply add together the total deviance for each number in our dataset. So the equation would look like this: \[ total\;deviance = \Sigma(X-\bar X) \] Unfortunately this equation causes some problems. If you remember, the mean is the average score, so it’s the score at around 50%. But that means that for any dataset about half the scores will be above the mean and half the scores will be below the mean. Or another way to think about it, half the deviation scores will be negative and half the deviation scores will be positive. So if we add up all these scores, the total deviation will be equal to 0, but 0 doesn’t tell us much about the spread of the scores.

The way to overcome this problem is by squaring each deviation score, which makes all the deviation values positive and thus produces a positive number. This number is called the sum of squared errors of the sum of squares (SS) with the following formula. \[ sum\;of\;squares (SS) = \Sigma(X-\bar X)^2 \] This number is still somewhat inflated, since its constructed based on squared values. One way to fix this issue to to find the average dispersion, which will be based on the number in our sample. Since the sample is an estimate of the population, we actually don’t use N, but N -1. This number is called variance and has the following symbol and equation. \[ variance(s^2) = \frac {SS}{N-1} = \frac {\Sigma(X-\bar X)^2}{N-1} \]

This number is closer to the original units of measurement, but to make it more accurate, the original squared values need to be taken back out of the measure. To do this, the square root of the variance is calculated.

\[ s = \sqrt {\frac {\Sigma (X- \bar X)^2}{N-1}} \]

To find the standard deviation using R Studio we use the sd command to find it.

sd(Album_Sales$Adverts)
[1] 485.6552

Range

Another measure of dispersion is range. This is simply subtracting the highest score from the lowest score for a particular variable or vector of scores. So to find the range simply use the range function and then subtract the two numbers provided, which are the highest and the lowest.

range(Album_Sales$Adverts)
[1]    9.104 2271.860

Then simply subtract the scores

2271.860-9.104
[1] 2262.756

Interquartile Range

Another helpful type of range is the interquartile range, which is the range of numbers from the 25th and 75th percentile. Percentiles are just dividing up a dataset based on where certain scores are based on percentages. For example, we can look at what score is at the 50th percentile. To do that we use the quantile function (a quantile is the same thing as a percentile). x is the variable we are analyzing, and .5 is the percentile (.5 = 50%; .35 = 35%)

quantile(x = diamonds$carat, probs = .5)
50% 
0.7 

So the number at the 50th percentile is 0.7. Notice that this is the same as the median, described earlier. Remember that the median is the number from the dataset that is in the middle or center of the dataset. Because the mean is calculated, it may or may not be a number in the dataset. The mean is the number that describes the closest number to the average, but may or may not be a number contained in the dataset.

So to find the interquartile range, we want our two quartiles or numbers at “quarters” of the dataset, so 25% and 75%. Think of a dollar, which is made up of 4 quarters. We already know that half a dolar is 50 cents, which would be the median, but the other two quartiles (quarters) would be 25 cents and 75 cents.

quantile(x = diamonds$carat, probs = c(.25, .75))
 25%  75% 
0.40 1.04 

Finally, to get the interquartile ranks we simply subtract these two numbers.

1.04-.40
[1] 0.64

So the interquartile range is 0.64.